Introduction to R - Session 1

This session will cover:

  • What is R and R Studio?
  • Installing and using R Packages
  • Reading in and filtering data
  • Reformatting data and basic data cleaning
  • An introduction to next session

Why learn R?

  • Makes you think in terms of a workflow which is * Repeatable to other cases * Reproducible * Easy to maintain
  • Has many uses: * Dashboarding * Automated pipelines * Interactive reporting (RMarkdown) * Great data analysis functions * Great data visualisation functions
  • Longevity: code becomes out of date very very slowly if at all
  • Very active online community who are always innovating uses of R, and can provide support through online forums
  • The breadth of what you can do: pretty much anything with data
  • Web apps (shiny) and automated reporting (RMarkdown)
  • Helpful support community
  • Transferable skill
  • Productivity - reproducible and automatable pipelines can lead to more time being spent on other areas

Preparing for the training session

The session is run in a ‘code-along’ format, therefore we recommend that you use a second screen if you have access to one or a second device. This is because it will be easiest for you to see the teacher’s screen and your own code at the same time.

You do not need to have R and R Studio installed on your computer. For the purposes of the training session we use RStudio cloud because all you need is a browser (we recommend Google Chrome or Microsoft Edge, definitely not Internet Explorer) so there is nothing to configure and no dedicated hardware, installation or annual purchase contract required. We will detail how to get R installed on your computer within Defra at the end of session 2.

You can sign up for a free “Cloud Free” account that allows you to create up to 15 projects, and use up to 15 project hours per month.

You can also run the training materials on an up to date local installation of R/RStudio (data is not offsen and is publically available or example data so you could also use your own personal machine), but there may be a few minor differences - the instructions below assume you’re using RStudio cloud.

When you’re working through the materials, we recommend that you type the commands rather than using cut-paste.

Before the training session, please make sure you have:

1) Have Google Chrome or Microsoft Edge installed on your computer.

2) Signed up for an account here.

3) Created a new project by clicking on the ‘New Project’ button, and given your project a name

Note that in RStudio Cloud you must work in a project.

RStudio projects make it straightforward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents. On RStudio Desktop you don’t have to work in projects but WE STRONGLY RECOMMEND THAT YOU DO. You can read more about using Projects here. and see here for some more helpful background here.

If you’ve used R before, especially if some time ago, you may have come across code which uses the R function setwd(“path/to/my/data”). This is not good practice and is not necessary when using projects.

Likewise previous R users may have come across the attach() function. Again you should not use this, it is not good practice.

RStudio Cloud may not be suitable for official work purposes so please only use it for learning and training purposes.

What is R & R Studio?

What is R?

R is a computer language and environment for data manipulation, data analysis and graphical display. Importantly R is a free, open source software which means it is accessible to all and is continually improving. There are other computer languages, notably Python, used in data science, but R is predominantely used due to it’s easy to understand syntax and availability of specific data science/science/analytical packages. And certaintly within Defra the majority of data scientist/analysts will be using R.

For more information on R vs Python you can see this article in towards datascience.

What is RStudio?

RStudio is an integrated development environment (IDE) for R. It is a user interface for the R programming language making writing and running code much easier and more intuitive, as well as providing a host of addins to make your working life easier including: a console, syntax-highlighting editor that supports direct code execution (basically, colours your code and allows a user to select code to run on the fly), tools for plotting, history, debugging and workspace management.

RStudio is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or in a browser connected to RStudio Server or RStudio Server Pro.

Getting the training data into RStudio Cloud

Once you have the zip file of training materials, start up RStudio Cloud and go to your new project that you created above. In the bottom right window, under the Files tab, you’ll see a button called Upload. Click on that, and follow the instructions to upload your zipped file. RStudio will automatically unzip the files and place them in your project. This includes the training documents themselves, which don’t need to be available to R, but it’s no harm to have them there.

Getting set up

Although we can type R commands into the console window (bottom left - useful for quick code snippets which we don’t want to save), it’s better to type commands into a script which we can save for future use. From a script we can run an individual line of code, a block of code or the entire script.

To set up a new blank script, click on the little plus sign in the top left of your screen underneath File.

It’s good practice to put a name, a date and author in scripts.

# Intro to R part 1
# Date: the current date
# Author: name

You can use # to add notes throughout your script, the # means R knows not to try and run the line as code. Except for the install.packages() command below, type the other commands below into your script. You can save your script regularly.

Note: all code in this session is shown in a grey box as above.

Functions and packages

The way that we use R is to write code which utilises functions - functions being actions that we do to a piece of data, or object in R. But where do the functions come from? Packages.

R Packages are collections of functions written and tested and published to an online storage website (repository), the official repository being CRAN. For a package to be published here, it needs to pass several tests that ensure the package is following CRAN policies. In simple terms - we trust packages that come from CRAN to be reliable.

Types of R programming

When you install R onto your computer you are using something called Base R. This is a version of R without any add-ons. It contains many useful functions and an easy syntax to understand, however to increase the effectiveness of R we will want to use and install packages.

Due to R being Open Source many of the packages we can download and use have different ways of being used or written depending on authors preferences. For instance we may have to tell one function what column of a dataset to use first, then the data, and visa versa for another function from another package. This can make programming efficiently difficult. Additionally some packages may not work well with each other as functions from within packages may return an unexpected data type or data structure when transforming.

Most people who use R now are using a group of packages called the tidyverse.

The tidyverse is a collection of packages which share common philosophies and are designed to work together. They were introduced to the world by Hadley Wickham (one of the biggest names in R) and mainly support a tidy data approach to working with data (more on that later). As the packages share syntax, philosophies and are designed to work together, they use making R alot easier and more intuative, and packages are updated frequently with new features meaning that analysts can stay up to date with best practices and emerging methods.

During this introduction we will touch on Base R, but will mainly focus on using the tidyverse. On thing to take away from this is that there are many different ways of doing the same thing in R, here we choose to do things the “tidyverse” way.

Installing packages

The first time you use a package on your computer you need to install it. This gets the package from the repository (generally CRAN) and puts it onto your computer (or your project in RStudio cloud).

You can do this with the install.packages() function. Type this command into the console, not in your script, as you won’t need to re-run it if you come back to your script another time.

install.packages("tidyverse") 

Once you run the above you should never have to run it again (unless you uninstall R or delete some package libraries some how…) as the package will now be installed on your machine.

Here we are only going to install a group of packages from called tidyverse. Most of the time today we are going to use the functions built-in to R (“base R”).

Note that when using RStudio Cloud, each separate project has its own package library, so if you create another project you will need to install your required packages from scratch. This is different from running R on your local computer (or in a virtual machine) where the default is a single package library available to any project.

Loading packages

library() is the command used to load a package into a R session. At first the difference between installing a package with install.packages() and loading a package with library() can be confusing.

“a package is a like a book, a library is like a library; you use library() to check a package out of the library” - Hadley Wickham

# For example we can load a package called ggplot2
library("ggplot2") # we are back typing commands in your script now
library("magrittr")
# note no use of " " when using library command
# We could now use functions from ggplot2 to plot data

Note: to run a line of code in R you click the line of code you want to run, then press Ctrl+Enter or click the Run button.

When we use {r} library() we load all of the functions within a package, however we can also load one function from an installed package using whats called “namespacing”. Some users prefer this way of calling functions as it also identifies what package the function comes from, allowing easier interpretation of code.

# namespace example
ggplot2::ggplot()

# or

library("ggplot2")
ggplot()

# note - example uses empty functions 

Code not working?

Common causes of error:

  • no “” when needed (and vice versa)
  • misspelling
  • wrong number of ()
  • wrong case: pretty much everything in R is case sensitive

Warnings

Warnings are shown when R has encountered a problem processing some part of your commands, but can continue processing the other commands. However, the result might not be what you would expect, so you should carefully examine all variables used in the command that triggered the warning.

Errors

Errors mean R cannot continue processing your commands and returns to the command prompt showing a brief description of the error.

Session 1:

Project Brief

We have been tasked to compare the performance of the ESG indicator, “Forest area (% of land area)” between the UK and Netherlands as part of a data exploration exercise.

Installing and loading packages, reading data into R.

1.1 Install and library packages

As this is the first time we are using R we will need to install our packages. We do this by running the function install.packages.

## install packages
install.packages("tidyverse")
install.packages("janitor")
install.packages("readxl")
install.packages("readr")

A quick side-note on using lists in R… We can pass lists to most functions to repeat that function for multiple things. We use the syntax c() to do this, where within c() we provide a list, like so c(1,2,3,4,5). A good example of this is installing all the above packages using one line…

install.packages(c("tidyverse", "janitor", "readxl", "readr", "pillar")) ## and so on

Library packages

“a package is a like a book, a library is like a library; you use library() to check a package out of the library” - Hadley Wickham

We need to library our packages in order to use their contents.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ✔ purrr   0.3.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract()   masks magrittr::extract()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ purrr::set_names() masks magrittr::set_names()
library(janitor) 
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(readxl)
library(readr)

# note here that we cannot use the list c() operator with library, it must be done one at a time. 

Use the read_xlsx function from the readxl package to read your data (stored in a .xslx file) into R.

Here we read the data in the .xlsx file into R and store it as what we call an object, or data frame - we will name this ESG_data.

A data frame is an R data structure, basically a rectangular matrix of data where each column is named. Each column can have a different data type, but unlike Excel, everything in a single column must have the same data type. Integer, decimal, character are the most common data types.

The <- symbol is called the assignment operator or just assign, it’s just the less than (<) character immediately followed by the minus sign (-). Note the spaces either side of <-, these are not essential but help readability. But there must NOT be a space between < and -. Likewise there must not be a space between read.csv and the open bracket.

KEYBOARD SHORTCUT Alt and - (press Alt and the minus sign together). This will produce the assignment operator correctly for you and save typing time.

ESG_data <- readxl::read_xlsx("../data/ESG_data.xslx")
ESG_data
## # A tibble: 16,013 × 67
##    `Country Name` `Country Code` `Indicator Name` `Indicator Code` `1960` `1961`
##    <chr>          <chr>          <chr>            <chr>             <dbl>  <dbl>
##  1 Arab World     ARB            Access to clean… EG.CFT.ACCS.ZS       NA   NA  
##  2 Arab World     ARB            Access to elect… EG.ELC.ACCS.ZS       NA   NA  
##  3 Arab World     ARB            Adjusted saving… NY.ADJ.DRES.GN.…     NA   NA  
##  4 Arab World     ARB            Adjusted saving… NY.ADJ.DFOR.GN.…     NA   NA  
##  5 Arab World     ARB            Agricultural la… AG.LND.AGRI.ZS       NA   27.8
##  6 Arab World     ARB            Agriculture, fo… NV.AGR.TOTL.ZS       NA   NA  
##  7 Arab World     ARB            Annual freshwat… ER.H2O.FWTL.ZS       NA   NA  
##  8 Arab World     ARB            Annualized aver… SI.SPR.PCAP.ZG       NA   NA  
##  9 Arab World     ARB            Cause of death,… SH.DTH.COMM.ZS       NA   NA  
## 10 Arab World     ARB            Children in emp… SL.TLF.0714.ZS       NA   NA  
## # … with 16,003 more rows, and 61 more variables: `1962` <dbl>, `1963` <dbl>,
## #   `1964` <dbl>, `1965` <dbl>, `1966` <dbl>, `1967` <dbl>, `1968` <dbl>,
## #   `1969` <dbl>, `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>,
## #   `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>,
## #   `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>,
## #   `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>,
## #   `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, …

Explore Data

1.2 - View data

There are various different ways to look at your data in R, try each of these and look at the differences between them.

## Just use the object name
ESG_data

## Look at the top 5 rows
head(ESG_data, 5)

## Look at the top 20 rows
head(ESG_data, 20)

## Open data in a new window (xl spreadsheet like)
# View(ESG_data) 

## Shows a column wide view of data, notice the namespace loading the glimpse function from the pillar package
pillar::glimpse(ESG_data)

## Show column names in the order they appear in your data
colnames(ESG_data)

## If we want to view or select a particular column of a dataset we can use the $ operator to view that column.
ESG_data$`Country Name`

## Another useful function is unique() which shows us unique values from columns
unique(ESG_data$`Country Name`)

When we load data into R it takes the column names in the raw data format. Usually for programming we want to standarise this as one of our first cleaning steps. When naming columns yourself it’s best to stick to upper or lower case characters, numbers (but not at the beginning of the name), the underscore and dot characters (_ and .). With this dataset, we will sort out these issues later.

Wrangling data

Data wrangling is about getting your data in the right format to be able to analyse it. The next few steps take you through common data wrangling problems and how you can solve them in R.

1.3 Deleting rows and columns

Firstly we are going to standardise our column names using a function from the janitor package called clean_names. This converts all of our column names to lower case and any spaces to “_“.

colnames(ESG_data) ## before
##  [1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
##  [5] "1960"           "1961"           "1962"           "1963"          
##  [9] "1964"           "1965"           "1966"           "1967"          
## [13] "1968"           "1969"           "1970"           "1971"          
## [17] "1972"           "1973"           "1974"           "1975"          
## [21] "1976"           "1977"           "1978"           "1979"          
## [25] "1980"           "1981"           "1982"           "1983"          
## [29] "1984"           "1985"           "1986"           "1987"          
## [33] "1988"           "1989"           "1990"           "1991"          
## [37] "1992"           "1993"           "1994"           "1995"          
## [41] "1996"           "1997"           "1998"           "1999"          
## [45] "2000"           "2001"           "2002"           "2003"          
## [49] "2004"           "2005"           "2006"           "2007"          
## [53] "2008"           "2009"           "2010"           "2011"          
## [57] "2012"           "2013"           "2014"           "2015"          
## [61] "2016"           "2017"           "2018"           "2019"          
## [65] "2020"           "2050"           "...67"
ESG_data <- janitor::clean_names(ESG_data) 
colnames(ESG_data) ## after
##  [1] "country_name"   "country_code"   "indicator_name" "indicator_code"
##  [5] "x1960"          "x1961"          "x1962"          "x1963"         
##  [9] "x1964"          "x1965"          "x1966"          "x1967"         
## [13] "x1968"          "x1969"          "x1970"          "x1971"         
## [17] "x1972"          "x1973"          "x1974"          "x1975"         
## [21] "x1976"          "x1977"          "x1978"          "x1979"         
## [25] "x1980"          "x1981"          "x1982"          "x1983"         
## [29] "x1984"          "x1985"          "x1986"          "x1987"         
## [33] "x1988"          "x1989"          "x1990"          "x1991"         
## [37] "x1992"          "x1993"          "x1994"          "x1995"         
## [41] "x1996"          "x1997"          "x1998"          "x1999"         
## [45] "x2000"          "x2001"          "x2002"          "x2003"         
## [49] "x2004"          "x2005"          "x2006"          "x2007"         
## [53] "x2008"          "x2009"          "x2010"          "x2011"         
## [57] "x2012"          "x2013"          "x2014"          "x2015"         
## [61] "x2016"          "x2017"          "x2018"          "x2019"         
## [65] "x2020"          "x2050"          "x67"

Now we can experiment with removing columns and rows. In base R we can do this using the [] format. A square bracket written directly after your data = [row, column]. For example I want to see the 2nd row and the 4th column of my data x I would do the following x[2,4]. We can remove values by putting a - before the reference. A blank space is ignored.

## delete rows
ESG_data_cleaning <- ESG_data[-1, ] #square bracket = [row, column]
ESG_data_cleaning # view the data and we can see the first total row is gone
## # A tibble: 16,012 × 67
##    country_name country_code indicator_name  indicator_code  x1960  x1961  x1962
##    <chr>        <chr>        <chr>           <chr>           <dbl>  <dbl>  <dbl>
##  1 Arab World   ARB          Access to elec… EG.ELC.ACCS.ZS NA     NA     NA    
##  2 Arab World   ARB          Adjusted savin… NY.ADJ.DRES.G… NA     NA     NA    
##  3 Arab World   ARB          Adjusted savin… NY.ADJ.DFOR.G… NA     NA     NA    
##  4 Arab World   ARB          Agricultural l… AG.LND.AGRI.ZS NA     27.8   27.8  
##  5 Arab World   ARB          Agriculture, f… NV.AGR.TOTL.ZS NA     NA     NA    
##  6 Arab World   ARB          Annual freshwa… ER.H2O.FWTL.ZS NA     NA     NA    
##  7 Arab World   ARB          Annualized ave… SI.SPR.PCAP.ZG NA     NA     NA    
##  8 Arab World   ARB          Cause of death… SH.DTH.COMM.ZS NA     NA     NA    
##  9 Arab World   ARB          Children in em… SL.TLF.0714.ZS NA     NA     NA    
## 10 Arab World   ARB          CO2 emissions … EN.ATM.CO2E.PC  0.607  0.661  0.725
## # … with 16,002 more rows, and 60 more variables: x1963 <dbl>, x1964 <dbl>,
## #   x1965 <dbl>, x1966 <dbl>, x1967 <dbl>, x1968 <dbl>, x1969 <dbl>,
## #   x1970 <dbl>, x1971 <dbl>, x1972 <dbl>, x1973 <dbl>, x1974 <dbl>,
## #   x1975 <dbl>, x1976 <dbl>, x1977 <dbl>, x1978 <dbl>, x1979 <dbl>,
## #   x1980 <dbl>, x1981 <dbl>, x1982 <dbl>, x1983 <dbl>, x1984 <dbl>,
## #   x1985 <dbl>, x1986 <dbl>, x1987 <dbl>, x1988 <dbl>, x1989 <dbl>,
## #   x1990 <dbl>, x1991 <dbl>, x1992 <dbl>, x1993 <dbl>, x1994 <dbl>, …
ESG_data_cleaning <- ESG_data[-c(1:67), ] #square bracket = [row, column]

## delete cols
ESG_data_cleaning <- ESG_data[ ,-4]
ESG_data_cleaning # column 4 (indicator code) is deleted
## # A tibble: 16,013 × 66
##    country_name country_code indicator_name  x1960 x1961 x1962 x1963 x1964 x1965
##    <chr>        <chr>        <chr>           <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Arab World   ARB          Access to clea…    NA  NA    NA    NA    NA    NA  
##  2 Arab World   ARB          Access to elec…    NA  NA    NA    NA    NA    NA  
##  3 Arab World   ARB          Adjusted savin…    NA  NA    NA    NA    NA    NA  
##  4 Arab World   ARB          Adjusted savin…    NA  NA    NA    NA    NA    NA  
##  5 Arab World   ARB          Agricultural l…    NA  27.8  27.8  27.8  27.8  27.9
##  6 Arab World   ARB          Agriculture, f…    NA  NA    NA    NA    NA    NA  
##  7 Arab World   ARB          Annual freshwa…    NA  NA    NA    NA    NA    NA  
##  8 Arab World   ARB          Annualized ave…    NA  NA    NA    NA    NA    NA  
##  9 Arab World   ARB          Cause of death…    NA  NA    NA    NA    NA    NA  
## 10 Arab World   ARB          Children in em…    NA  NA    NA    NA    NA    NA  
## # … with 16,003 more rows, and 57 more variables: x1966 <dbl>, x1967 <dbl>,
## #   x1968 <dbl>, x1969 <dbl>, x1970 <dbl>, x1971 <dbl>, x1972 <dbl>,
## #   x1973 <dbl>, x1974 <dbl>, x1975 <dbl>, x1976 <dbl>, x1977 <dbl>,
## #   x1978 <dbl>, x1979 <dbl>, x1980 <dbl>, x1981 <dbl>, x1982 <dbl>,
## #   x1983 <dbl>, x1984 <dbl>, x1985 <dbl>, x1986 <dbl>, x1987 <dbl>,
## #   x1988 <dbl>, x1989 <dbl>, x1990 <dbl>, x1991 <dbl>, x1992 <dbl>,
## #   x1993 <dbl>, x1994 <dbl>, x1995 <dbl>, x1996 <dbl>, x1997 <dbl>, …
## remove cols and rows in one step
ESG_data_cleaning <- ESG_data[-c(1:67), -4]

This initial concept is useful to understand, but within the real world and also tidyverse we would ordinarily filter data rather than delete it, and use a function dplyr::select to select particular columns of interest (again rather than deleting them).

Lets try selecting data we want using tidyverse. For this example we are interested in comparing two countries, and one particular indicators, so lets get the data we need to filter our dataset. Countries = United Kingdom and Netherlands. Indicators = Forest area (% of land area).

Note - It is best practice to select rows or columns by name, this can help avoid erroneous results if there are changes to the format of the data. This is one of the reasons why most prefer the tidyverse way of filtering and selecting data.

But first… a note on getting help.

read_xlsx was our first R function, but this is our first example of using an R function with multiple arguments (arguments are just anything within the brackets of the function name. Each argument other than the first is named, using the syntax argument = value. We are going to be passing two arguments to dplyr::filter. Here is the completed code:

## Filter country column first, call this variable ESG_filtered using the original data
ESG_filtered <- dplyr::filter(ESG_data, country_name == "United Kingdom" | country_name ==  "Netherlands") ## like in excel we can use the OR operator.
unique(ESG_filtered$country_name)
## [1] "Netherlands"    "United Kingdom"
  • Firstly, ESG_data is the data frame we’re working on. We’ve not named this argument by convention, but we could have said data = ESG_data.
  • Next using the country_name == "United Kingdom" argument we are saying that we want country_name only when it equals United Kingdom.
  • Taking this a step further we are going to use the | operator to signify or. Therefore the full dplyr::filter(ESG_data, country_name == "United Kingdom" | country_name == "Netherlands") translates to *filter the ESG_data data frame where the column country_name is equal to the United Kingdom or Netherlands.

Anyone using R, however experienced, is not going to remember the arguments of most of the functions they use. Fortunately it’s easy to bring up a help page for any R function:

?dplyr::filter
# or
help(filter) # and select the dplyr option
## Help on topic 'filter' was found in the following packages:
## 
##   Package               Library
##   stats                 /usr/lib/R/library
##   dplyr                 /home/oliviapodmore/R/x86_64-pc-linux-gnu-library/4.2
## 
## 
## Using the first match ...

You will see information appear in the “help” tab in the bottom right tab in R Studio.

Now, with the above in mind, lets filter our data.

## Filter country column first, call this variable ESG_filtered using the original data
ESG_filtered <- dplyr::filter(ESG_data, country_name == "United Kingdom" | country_name == "Netherlands") ## like in excel we can use the OR operator.
unique(ESG_filtered$country_name)
## [1] "Netherlands"    "United Kingdom"
## We are only interested in one indicator so lets filter the indicator column overwriting our previous ESG_filtered
ESG_filtered <- dplyr::filter(ESG_filtered, indicator_name == "Forest area (% of land area)") ## to write equals we use ==
unique(ESG_filtered$indicator_name)
## [1] "Forest area (% of land area)"
## Now if we look at our column names there are two columns that we aren't interested in...
colnames(ESG_filtered)
##  [1] "country_name"   "country_code"   "indicator_name" "indicator_code"
##  [5] "x1960"          "x1961"          "x1962"          "x1963"         
##  [9] "x1964"          "x1965"          "x1966"          "x1967"         
## [13] "x1968"          "x1969"          "x1970"          "x1971"         
## [17] "x1972"          "x1973"          "x1974"          "x1975"         
## [21] "x1976"          "x1977"          "x1978"          "x1979"         
## [25] "x1980"          "x1981"          "x1982"          "x1983"         
## [29] "x1984"          "x1985"          "x1986"          "x1987"         
## [33] "x1988"          "x1989"          "x1990"          "x1991"         
## [37] "x1992"          "x1993"          "x1994"          "x1995"         
## [41] "x1996"          "x1997"          "x1998"          "x1999"         
## [45] "x2000"          "x2001"          "x2002"          "x2003"         
## [49] "x2004"          "x2005"          "x2006"          "x2007"         
## [53] "x2008"          "x2009"          "x2010"          "x2011"         
## [57] "x2012"          "x2013"          "x2014"          "x2015"         
## [61] "x2016"          "x2017"          "x2018"          "x2019"         
## [65] "x2020"          "x2050"          "x67"

There is a “x67” column and a “x2050” which aren’t useful to us in this dataset, so we will get rid of them. We will use -c here to denote we want to select everything BUT the columns listed.

ESG_filtered <- dplyr::select(ESG_filtered, -c(x2050, x67))

colnames(ESG_filtered)
##  [1] "country_name"   "country_code"   "indicator_name" "indicator_code"
##  [5] "x1960"          "x1961"          "x1962"          "x1963"         
##  [9] "x1964"          "x1965"          "x1966"          "x1967"         
## [13] "x1968"          "x1969"          "x1970"          "x1971"         
## [17] "x1972"          "x1973"          "x1974"          "x1975"         
## [21] "x1976"          "x1977"          "x1978"          "x1979"         
## [25] "x1980"          "x1981"          "x1982"          "x1983"         
## [29] "x1984"          "x1985"          "x1986"          "x1987"         
## [33] "x1988"          "x1989"          "x1990"          "x1991"         
## [37] "x1992"          "x1993"          "x1994"          "x1995"         
## [41] "x1996"          "x1997"          "x1998"          "x1999"         
## [45] "x2000"          "x2001"          "x2002"          "x2003"         
## [49] "x2004"          "x2005"          "x2006"          "x2007"         
## [53] "x2008"          "x2009"          "x2010"          "x2011"         
## [57] "x2012"          "x2013"          "x2014"          "x2015"         
## [61] "x2016"          "x2017"          "x2018"          "x2019"         
## [65] "x2020"

We’ve kept our original data frame, ESG_data, and created a new intermediate data frame called ESG_filtered, in case anything goes wrong and we have to refer back to the original data without reloading the data again.

Tidy data

Making a dataset ‘tidy’ is part of the data wrangling process in R.

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.

There 3 main qualities of tidy data:

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observational unit forms a table

You can read more about tidy data here. and here.

Of course it’s better if your data are tidy in the first place, but we often have to deal with “untidy” data!

1.4 Pivot Data

We use the pivot_longer() function from the tidyr package here to make our data set tidy (our data is currently in a “wide” format, and therefore messy, so to clean it we can lengthen our data to ensure each column is a variable, each row is an observation, and each cell is a value).

We are going to use a new way of telling R what columns to pivot. In our dataset we want to pivot all of the year columns, so we have told our function pivot_longer that cols = c(x1960:x2019). We are using the c() operator to select all columns between x1960 and x2019 using the : operator.

ESG_tidy <- tidyr::pivot_longer(ESG_filtered, cols = c(x1960:x2020), 
                                names_to = "year", 
                                values_to = "land_area")

# similar to copying and pasting using rows to columns in excel
ESG_tidy # take a look
## # A tibble: 122 × 6
##    country_name country_code indicator_name       indicator_code year  land_area
##    <chr>        <chr>        <chr>                <chr>          <chr>     <dbl>
##  1 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS x1960        NA
##  2 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS x1961        NA
##  3 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS x1962        NA
##  4 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS x1963        NA
##  5 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS x1964        NA
##  6 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS x1965        NA
##  7 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS x1966        NA
##  8 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS x1967        NA
##  9 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS x1968        NA
## 10 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS x1969        NA
## # … with 112 more rows

We create a new data frame called ESG_tidy. You’ll see that the output of pivot_longer() is a “tibble” this is just a neat type of data frame. Don’t worry about this for now, just think of it as a type of dataframe.

In this example we have used a function from a package. In many cases, packages add functionality that is simply not present in base R. This is one of many cases where we COULD have used a function in base R (in this case that’s called reshape). However in this case we use pivot_longer because it’s so much easier to use than reshape. We will come across other situations in Session 2 where we choose to use functions in packages rather than their base R equivalents.

Cleaning data

1.5 Removing values

Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.

Remove NAs

Actually we don’t want the NAs at all, so we need to remove these but before doing that it’s worth checking where the NAs are. Our first line of code does two things, reading from the inside the function is.na() returns a 1 if NA is found and 0 otherwise. The colSums() function is wrapped around is.na() and adds up the 1s so we can clearly see where the NAs occur. This is good practice because running the next line of code removes all NAs from the dataframe which in this instance is ok. In datasets with more columns you may have NAs in other columns but not on the same rows so applying na.omit() to a dataset may give unwelcome results. There are other functions that allow you to be more selective with handling NAs but that is outside this session.

colSums(is.na(ESG_tidy))
##   country_name   country_code indicator_name indicator_code           year 
##              0              0              0              0              0 
##      land_area 
##             68
ESG_clean <- na.omit(ESG_tidy)

## being selective with our NA removal example
ESG_clean <- dplyr::filter(ESG_tidy, is.na(ESG_tidy$land_area) == F) # we filter ESG_tidy$land_area where is.na returns false (i.e. it doesn't contain an NA value)

Removing unwanted characters

If we take a look at the year column we just made by pivoting our data we will see that it contains “x” then a year. We want to remove the x to create a numeric year column.

We can focus our removing of the “x” values by using a function from the package stringr::str_remove. This uses the sytax of “string” (where do you want to remove a pattern from), and “pattern” (what do you want to remove).

ESG_clean$year
##  [1] "x1990" "x1991" "x1992" "x1993" "x1994" "x1995" "x1996" "x1997" "x1998"
## [10] "x1999" "x2000" "x2001" "x2002" "x2003" "x2004" "x2005" "x2006" "x2007"
## [19] "x2008" "x2009" "x2010" "x2011" "x2012" "x2013" "x2014" "x2015" "x2016"
## [28] "x1990" "x1991" "x1992" "x1993" "x1994" "x1995" "x1996" "x1997" "x1998"
## [37] "x1999" "x2000" "x2001" "x2002" "x2003" "x2004" "x2005" "x2006" "x2007"
## [46] "x2008" "x2009" "x2010" "x2011" "x2012" "x2013" "x2014" "x2015" "x2016"
stringr::str_remove(ESG_clean$year, "x")
##  [1] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999"
## [11] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [21] "2010" "2011" "2012" "2013" "2014" "2015" "2016" "1990" "1991" "1992"
## [31] "1993" "1994" "1995" "1996" "1997" "1998" "1999" "2000" "2001" "2002"
## [41] "2003" "2004" "2005" "2006" "2007" "2008" "2009" "2010" "2011" "2012"
## [51] "2013" "2014" "2015" "2016"

Given that we know the year column only wants to be numbers we could do a more general cleaning incase any spaces or other characters have crept in, we can do this using a bit of “regular expression” (regex) and another function from the stringr package, stringr::str_extract to extract all the numeric characters from the column, removing everything else.

ESG_clean$year <- stringr::str_extract(ESG_clean$year, "[[:digit:]]*$")
ESG_clean
## # A tibble: 54 × 6
##    country_name country_code indicator_name       indicator_code year  land_area
##    <chr>        <chr>        <chr>                <chr>          <chr>     <dbl>
##  1 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS 1990       10.2
##  2 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS 1991       10.3
##  3 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS 1992       10.3
##  4 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS 1993       10.4
##  5 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS 1994       10.4
##  6 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS 1995       10.4
##  7 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS 1996       10.5
##  8 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS 1997       10.5
##  9 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS 1998       10.6
## 10 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS 1999       10.6
## # … with 44 more rows

This function again uses the same syntax as the stringr::str_remove but instead of a pattern to remove, we are giving it a pattern to keep. The pattern in this case translates to: all digits ([[:digit:]]*) until the end of the string $.

Don’t worry about regular expressions/regex yet though, even for experienced R users regular expressions can be tricky but they are very powerful when dealing with text based datasets that you may encounter at some point. The above example is a simple introduction to what we can do with regex.

1.6 Data types

R has a wide variety of data types including scalars (a single piece of data), vectors - a series of data of the same type (numerical, character, logical), matrices, data frames, and lists.

R will automatically set data types when you read files into R, however, sometimes R will not do what you want so you may need to change. Data types determine how the data is stored in your computer and how analysis/visualization will treat variables. You can read more about data types in R here.

Change variable data types, from character (text) to numeric. Once we pivoted our earlier dataset we created a year column which contained (before we cleaned it) and x and then a number like 1993. When we run our dataframe (just run ESG_clean) we can see the datatype of the year column in character, the character strings which were there are why the column was labeled as a character rather than numeric. Using incorrect datatypes is probably one of the most common errors we get in programming, and will leave you with an error along the lines of function x expected numerical but received character. For instance we could pass a date column to a function as a character data type rather than a date type, or as in the example above a year column passed as a character rather than a numeric value.

Sometimes when we change a character column to numeric any cells containing characters will be flagged as NA, and you’ll get a message from R that it’s done this. This is why (in this case) we want to remove characters before changing the data type. In other cases we might be fine with just changing the datatype and ignoring the NA values, it all depends on what data you are using and the purpose.

Let’s give this a try…

ESG_clean$year <- as.numeric(ESG_clean$year) # Make year an integer
ESG_clean$land_area <- as.numeric(ESG_clean$land_area) # Make area numeric

And with all the above steps done we should have a nice clean dataset with which we are ready to work on some analysis and visualisation.

ESG_clean
## # A tibble: 54 × 6
##    country_name country_code indicator_name       indicator_code  year land_area
##    <chr>        <chr>        <chr>                <chr>          <dbl>     <dbl>
##  1 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS  1990      10.2
##  2 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS  1991      10.3
##  3 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS  1992      10.3
##  4 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS  1993      10.4
##  5 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS  1994      10.4
##  6 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS  1995      10.4
##  7 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS  1996      10.5
##  8 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS  1997      10.5
##  9 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS  1998      10.6
## 10 Netherlands  NLD          Forest area (% of l… AG.LND.FRST.ZS  1999      10.6
## # … with 44 more rows

Summary of learning

Through this session we have explored (at a very high level).. * What is R and R Studio? * Installing and using R Packages * Reading in and filtering data * Reformatting data and basic data cleaning * An introduction to next session

Next week we will be covering… * Writing code using pipes - %>% * Using If / If Else statements * Joining different datasets together * Visualisation with ggplot2 * Editing visualisations * Exporting graphs and datasets

End of Session 1!

Taster of session 2…

library(magrittr)

ESG_data <- readxl::read_xlsx("../data/ESG_data.xslx") %>%
  janitor::clean_names() %>%
  dplyr::select(-c(x2050, x67)) %>%
  dplyr::filter(country_name == "United Kingdom" | country_name == "Netherlands",
                indicator_name == "Forest area (% of land area)") %>%
  tidyr::pivot_longer(cols = c(x1960:ncol(.)), names_to = "year", values_to = "land_area")

ESG_clean <- ESG_data %>%
  dplyr::filter(is.na(land_area) == F) %>% 
  dplyr::mutate(year = as.numeric(stringr::str_remove_all(year, "x")))

ggplot2::ggplot(ESG_clean, ggplot2::aes(x = year, y = land_area, color = country_name)) +
  ggplot2::geom_line()

Help

For general help:

For data organisation in spreadsheets here.

For graphs:

For outputting graphs see here.

Defra community:

Online community for help: * First try google. Most questions have been asked before. * Stack overflow